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(4)  Statement  of  Problems  Studied: 

Investigations  were  carried  out  on  Markov  dependence  in  statistics  and  information  theory,  and 
statistical  problems  in  physical  mapping.  Results  were  obtained  on  the  Minimum  Description 
Length  Principle  and  statistical  inference,  adaptive  quantization  in  image  compression,  Markov 
chain  Monte  Carlo  methods,  and  statistical  problems  in  the  Human  Genome  Project.  These  results 
shed  light  on  the  connections  between  information  theory  and  statistics,  on  the  role  of  parametric 
models  in  quantization  and  image  compression,  on  understanding  the  convergence  behaviors  of 
Markov  chain  Monte  Cai'lo  samplers,  and  on  the  infomiation  needed  for  a  clone  map  of 
chromosomes.  Furthermore,  a  wavelet  image  coder  is  designed  as  part  of  the  investigation  and  it 
gives  an  excellent  performance  on  test  images. 

(5)  Summary  of  The  most  Important  Results 

For  many  years,  the  PI  has  been  intrigued  by  the  inteiplay  between  informatioiycommunication 
theory  and  statistics.  In  the  period  supported  by  the  grant,  the  PI  worked  on  Minimum  Description 
Length  (MDL)  Principle  of  Rissanen,  focusing  on  understanding  MDL  in  comparison  to  more 
conventional  statistical  methods.  Minimax  lower  bounds  for  smooth  continuous  Markov  source 
classes  are  technically  much  more  challenging  to  obtain  than  in  the  iid  case.  Nevertheless,  Yu 
( 1994)  gives  a  minimax  redundancy  lower  bound  in  the  Markov  case  via  a  recursive  equation. 
Based  on  mutual  information  calculations,  Yu  (1996)  derives  minimax  redundancy  lower  bounds 
for  smooth  density  classes  and  hence  unifies  the  minimax  approaches  to  redundancy  lower  bounds 
in  the  parametric  and  nonparametric  cases.  Yu  (1997)  explores  connections  between  important 
inequalities  in  statistics  and  information  theory.  Rissanen  and  Yu  (1995)  study  MDL  in  the 
computational  learning  context.  Barron,  Yang,  and  Yu  (1994)  show  for  the  first  time  that  the  extra 
log(n)  factor  is  not  necessary  for  MDL-based  density  estimators.  In  particular,  they  constructed  an 
optimal-rate  MDL-based  histogram  estimator  that  takes  into  the  account  the  Upschitz  condition  in 
the  coding  of  histogram  parameters.  Barron,  Rissanen  and  Yu  (1998)  is  an  invited  paper  for  the 
50th  anniversary  of  Information  Theory  to  appear  in  a  special  issue  of  IEEE  Transactions  on 
Infomiation  Theory.  It  reviews  the  theoretical  developments  of  MDL  and  makes  connections 
among  different  formulations  of  the  redundancy  problem  in  coding  which  motivates  and  validates 
MDL.  Moreover,  explicit  connections  between  statistical  estimation  and  lossless  data  compression 
are  also  made. 

Markov  Chain  Monte  Carlo  (MCMC)  methods  have  attracted  much  attention  from  researchers  as  an 
important  computational  tool  for  a  variety  of  applications  including  likelihood  computation  in 
frequentist  statistics  and  posterior  computation  in  Bayesian  statistics.  Because  the  target 
distribution  is  the  stationary  distribution  for  the  constructed  Markov  chain,  the  success  of  the 
MCMC  method  relies  cmcially  on  our  ability  to  assess  the  convergence  of  the  chain  to  its 
equilibrium.  The  Pi’s  interest  in  Markov  Chain  Monte  Carlo  (MCMC)  has  been  on  output  error 
assessment  and  convergence  diagnostics  issues.  Mykland,  Tierney  and  Yu  (1995)  propose  to  use 
the  split-chain  idea  of  Nummelin,  Athreya  and  Ney.  In  this  way,  known  results  for  regenerative 
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simulations  apply  and  more  reliable  estimates  of  the  variance  of  the  sample  mean  can  be  obtained 
from  the  simulated  chain.  In  the  discussion  (Yu,  1995)  on  the  Besag  et  al.  paper  in  Statistical 
Science  (cf.  Yu  and  Mykland,  1998),  the  cusum  plot  is  proposed  as  a  simple  diagnostic  tool  based 
on  a  one-dimensional  summary  of  the  MCMC  sample.  Strong  approximation  results  for  absolutely 
regular  sequences  are  used  to  argue  that  the  smoothness  of  the  line-joined  cusum  path  reflects  the 
mixing  speed  of  the  one-dimensional  summary  statistic:  the  faster  the  mixing,  the  more  hairy"  the 
cusum  plot.  Ostland  and  Yu  (1997)  present  a  manually-adaptive  extension  of  Quasi  Monte  Carlo 
(QMC)  methods  for  approximating  marginal  densities  as  a  viable  alternative  to  the  Metropolis 
algorithm  -  when  the  joint  density  is  known  up  to  a  normalization  constant.  Randomization  and  a 
batch-wise  approach  involving  (0,s)-sequences  are  the  cornerstones  of  our  method.  By 
incorporating  a  variety  of  graphic^  diagnostics  the  method  allows  the  user  to  adaptively  allocate 
points  for  joint  density  function  evaluations  and  therefore  produces  reliable  marginal  density 
approximations  in  moderate  dimensions. 

Much  of  the  recent  work  on  wavelet  subband  image  coding  has  relied  on  adaptive  quantization  to 
achieve  substantial  gains  over  other  traditional  techniques  -  for  example,  those  based  on  the 
Discrete  Cosine  Transform.  In  a  typical  example,  different  quantizers  are  used  for  different 
subbands,  or  for  blocks  within  a  given  subband,  and  the  quantizers  are  explicitly  sent  to  the 
decoder  as  overhead.  Then,  the  quantized  coefficients  are  transmitted  using  adaptive  entropy 
coding,  typically  thi'ough  arithmetic  coding.  In  this  example,  forward  adaptation  is  used  for  the 
quantizers  and  backward  adaptation  for  the  entropy  coding.  Yoo,  Ortega,  and  Yu  (1997)  show  that 
a  combination  of  forward  and  backward  adaptation  methods  can  be  used  to  update  the  quantizers  as 
well.  Specifically,  we  propose  an  algorithm  where  classification  can  be  done  based  on  the 
quantized  past  data  and  where  the  quantizer  to  be  used  within  one  class  can  itself  be  adapted  on  the 
fly.  Our  proposed  algorithm  gives  very  competitive  performance  on  standard  test  images  (in  terms 
of  Signal  to  Noise  ratio  under  the  mean  squared  distortion  measure). 

The  first  step  in  the  Human  Genome  Project  is  to  assemble  DNA  fragments,  called  clones,  to  form 
clone  maps,  which  allow  the  detailed  study  of  chromosomal  regions  of  biological  interest.  Yu  and 
Speed  (1997)  answer  biologist  Lehrach's  question  about  how  much  information  is  needed  to 
complete  a  clone  map  by  formulating  a  number  of  different  notions  (or  configuration  variables)  of  a 
clone  map.  The  entropy  of  each  notion  (or  configuration  variable)  is  tightly  bounded.  The  results 
ai'e  useful  for  planning  future  mapping  efforts.  In  particular,  based  on  the  bounds  in  Yu  and  Speed 
(1997),  comparisons  are  made  for  four  "model"  species  in  terms  of  information  needed  for  the 
mapping  of  their  respective  cosmid  clone  libraries.  It  follows  that  the  cosmid  clone  mapping  for  the 
roundworm  requires  about  40  times  as  much  information  as  that  for  the  bacterium  E.  Coli,  and  that 
such  mapping  for  humans  requires  about  1 ,500  times  as  much  information  as  that  for  the  bacterium 
E.  coli.  Another  interesting  fact  which  follows  from  the  entropy  bounds  is  that  a  variable  relating  to 
the  pairwise  approach  to  clone  mapping  contains  a  substantial  proportion  (more  than  20%)  of  the 
information  in  a  full  configuration  variable. 

Various  random  fingerprinting  methods  are  sometimes  used  to  detect  the  overlap  between  pairs  of 
clones  as  a  first  step  towards  producing  a  minimal  tiling  path  of  clones  for  subsequent  mapping 
and  sequencing  efforts.  Nelson,  Speed  and  Yu  (1997)  analyze  the  overlap  detection  problem  for 
two  clones.  They  evaluate  and  compare  various  statistical  procedures  for  the  two-clone  overlap 
based  on  random  fingerprinting  data.  In  particular,  they  quantify  the  limitations  of  random 
fingerprinting  as  a  way  to  detect  pairwise  overlap,  and  within  those  limitations,  the  most  effective 
ways  to  use  the  data.  Based  on  the  results,  it  is  concluded  that  random  fingerprinting  based 
methods  generate  very  weak  overlap  detectors,  confirming  what  biologists  are  discovering  by  trial 
and  error. 
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